In this paper, we will present an introduction to a collaborative filtering machine learning project based on the book recommendation database. Our goal is to explain the three main tables of this database, namely "user", "rating" and "book", and to show how they can be used to create a personalized book recommendation system using collaborative filtering techniques.
The "user" table contains information about users, such as their unique identifier, name, and geographic location. This can help understand users' preferences and behaviors when interacting with books.
The "rating" table contains information about the ratings given by users to different books, scored from 0 to 10. This data can be used to understand users' preferences for book content.
The "book" table contains information about individual books, including their unique identifier, title, author and genre. This information can be used to understand the characteristics of the most popular books among users.
The issue with book recommendation systems is that they often rely on basic collaborative filtering techniques or content-based approaches, which can lead to limited recommendations and fail to capture the user's preferences accurately. Additionally, the lack of diversity in the recommended books may lead to user dissatisfaction and reduce the system's effectiveness.
To address these issues, researchers aim to develop more sophisticated recommendation algorithms that can capture users' preferences and recommend books that align with their interests. These algorithms may include machine learning models that use natural language processing to analyze the content of the books and identify their themes and topics. Additionally, incorporating user feedback and integrating social network analysis may help create more personalized and diverse recommendations.
=> Overall, the goal is to develop book recommendation systems that can provide accurate and diverse recommendations to users, enhancing their reading experiences and increasing engagement with the platform.
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
palette = sns.color_palette("Set2")
import warnings
warnings.filterwarnings('ignore')
user=pd.read_csv("data/Users.csv")
book=pd.read_csv("data/Books.csv")
rating=pd.read_csv("data/Ratings.csv")
removed_books = 0
removed_ratings = 0
removed_users = 0
print('Number of books: ', book.shape[0])
book.head()
Number of books: 271360
| ISBN | Book-Title | Book-Author | Year-Of-Publication | Publisher | Image-URL-S | Image-URL-M | Image-URL-L | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0195153448 | Classical Mythology | Mark P. O. Morford | 2002 | Oxford University Press | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... | http://images.amazon.com/images/P/0195153448.0... |
| 1 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... | http://images.amazon.com/images/P/0002005018.0... |
| 2 | 0060973129 | Decision in Normandy | Carlo D'Este | 1991 | HarperPerennial | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... | http://images.amazon.com/images/P/0060973129.0... |
| 3 | 0374157065 | Flu: The Story of the Great Influenza Pandemic... | Gina Bari Kolata | 1999 | Farrar Straus Giroux | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... | http://images.amazon.com/images/P/0374157065.0... |
| 4 | 0393045218 | The Mummies of Urumchi | E. J. W. Barber | 1999 | W. W. Norton & Company | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... | http://images.amazon.com/images/P/0393045218.0... |
print('Number of ratings: ', rating.shape[0])
rating.head()
Number of ratings: 1149780
| User-ID | ISBN | Book-Rating | |
|---|---|---|---|
| 0 | 276725 | 034545104X | 0 |
| 1 | 276726 | 0155061224 | 5 |
| 2 | 276727 | 0446520802 | 0 |
| 3 | 276729 | 052165615X | 3 |
| 4 | 276729 | 0521795028 | 6 |
print('Number of users: ', user.shape[0])
user.head()
Number of users: 278858
| User-ID | Location | Age | |
|---|---|---|---|
| 0 | 1 | nyc, new york, usa | NaN |
| 1 | 2 | stockton, california, usa | 18.0 |
| 2 | 3 | moscow, yukon territory, russia | NaN |
| 3 | 4 | porto, v.n.gaia, portugal | 17.0 |
| 4 | 5 | farnborough, hants, united kingdom | NaN |
book.rename(columns={'Book-Title':'title',"Book-Author":"author","Year-Of-Publication":'year',"Publisher":"publisher","Image-URL-S":"image_s","Image-URL-M":"image_m","Image-URL-L":"image_links"},inplace=True)
book.drop(columns=['image_s','image_m'], inplace=True)
book.reset_index(drop=True,inplace=True)
book.head()
| ISBN | title | author | year | publisher | image_links | |
|---|---|---|---|---|---|---|
| 0 | 0195153448 | Classical Mythology | Mark P. O. Morford | 2002 | Oxford University Press | http://images.amazon.com/images/P/0195153448.0... |
| 1 | 0002005018 | Clara Callan | Richard Bruce Wright | 2001 | HarperFlamingo Canada | http://images.amazon.com/images/P/0002005018.0... |
| 2 | 0060973129 | Decision in Normandy | Carlo D'Este | 1991 | HarperPerennial | http://images.amazon.com/images/P/0060973129.0... |
| 3 | 0374157065 | Flu: The Story of the Great Influenza Pandemic... | Gina Bari Kolata | 1999 | Farrar Straus Giroux | http://images.amazon.com/images/P/0374157065.0... |
| 4 | 0393045218 | The Mummies of Urumchi | E. J. W. Barber | 1999 | W. W. Norton & Company | http://images.amazon.com/images/P/0393045218.0... |
L' ISBN des livres est composé de 10 chiffre.
Par exemple : "185326119x" ou "0393045218"
for i in range(len(book)):
if len(str(book.iloc[i,0]))!=10:
print (book.iloc[i,:])
print("_____________________________________________")
ISBN 0486404242\t title War in Kind: And Other Poems (Dover Thrift Edi... author Stephen Crane year 1998 publisher Dover Publications image_links http://images.amazon.com/images/P/0486404242.0... Name: 111808, dtype: object _____________________________________________ ISBN 3518365479<90 title Suhrkamp Taschenb�¼cher, Nr.47, Frost author Thomas Bernhard year 1972 publisher Suhrkamp image_links http://images.amazon.com/images/P/3518365479.0... Name: 171206, dtype: object _____________________________________________ ISBN 3442248027 3 title Diamond Age. Die Grenzwelt. author Neal Stephenson year 2000 publisher Goldmann image_links http://images.amazon.com/images/P/3442248027.0... Name: 251424, dtype: object _____________________________________________ ISBN 0385722206 0 title Balzac and the Little Chinese Seamstress : A N... author DAI SIJIE year 2002 publisher Anchor image_links http://images.amazon.com/images/P/0385722206.0... Name: 251649, dtype: object _____________________________________________
Nous remarquons qu'il y a des ISBN qui contiennent des erreurs dans la base de données Books:
# Corrige les erreurs de ISBN:
book.iloc[251649,0]='0385722206'
book.iloc[251424,0]='3442248027'
book.iloc[171206,0]='3518365479'
book.iloc[111808,0]='0486404242'
# Remplacer les ISBN _________X en _________x
book['ISBN'] = book['ISBN'].str.replace('X', 'x')
Nous remarquons qu'il y a des duplicats dans la base de données Books:
# Exemple des livres dupliqués
counts=book['ISBN'].value_counts()
for i, index in enumerate(counts[counts>1].index):
print(book[book.iloc[:,0]==index].iloc[:,[1,2]])
print("________________________________________________")
if i > 5:
break
title author
21723 The Jungle Book (Wordsworth Collection) Rudyard Kipling
116461 The Jungle Book (Wordsworth Collection) Rudyard Kipling
________________________________________________
title author
3591 The Professor and the Madman: A Tale of Murder... Simon Winchester
168738 The Professor and the Madman: A Tale of Murder... Simon Winchester
________________________________________________
title author
23122 The Greenlanders Jane Smiley
162302 The Greenlanders Jane Smiley
________________________________________________
title author
21741 Alice In Wonderland Great Illustrated CL (Alic... Lewis Carroll
200612 Alice In Wonderland Great Illustrated CL (Alic... Lewis Carroll
________________________________________________
title author
112736 The Old Patagonian Express : By Train Through ... Paul Theroux
140413 The Old Patagonian Express : By Train Through ... Paul Theroux
________________________________________________
title author
23127 The Loop: A Novel Nicholas Evans
50993 The Loop: A Novel Nicholas Evans
________________________________________________
title author
372 Lake News Barbara Delinsky
209711 Lake News Barbara Delinsky
________________________________________________
# Suppression des duplicats
print("Initial size: ",book.shape[0])
removed_books += book.shape[0]
book = book.drop_duplicates(subset=['ISBN'])
book.reset_index(drop=True,inplace=True)
print("Size after removing duplicates: ",book.shape[0])
removed_books -= book.shape[0]
print("Total removed books from the original dataset: ",removed_books)
Initial size: 271360 Size after removing duplicates: 271044 Total removed books from the original dataset: 316
Nous remarquons que nous avons des valeurs manquantes pour les attributs Author ainsi que Publisher:
book.isnull().sum()
ISBN 0 title 0 author 1 year 0 publisher 2 image_links 3 dtype: int64
# Afficher les livres qui ont des valeurs manquantes dans 'author'
book.loc[book["author"].isnull(),:]
| ISBN | title | author | year | publisher | image_links | |
|---|---|---|---|---|---|---|
| 187534 | 9627982032 | The Credit Suisse Guide to Managing Your Perso... | NaN | 1995 | Edinburgh Financial Publishing | http://images.amazon.com/images/P/9627982032.0... |
# Afficher les livres qui ont des valeurs manquantes dans 'publisher'
book.loc[book['publisher'].isnull(),:]
| ISBN | title | author | year | publisher | image_links | |
|---|---|---|---|---|---|---|
| 128811 | 193169656x | Tyrant Moon | Elaine Corvidae | 2002 | NaN | http://images.amazon.com/images/P/193169656X.0... |
| 128958 | 1931696993 | Finders Keepers | Linnea Sinclair | 2001 | NaN | http://images.amazon.com/images/P/1931696993.0... |
# Rajout manuel de l'auteur
book.author[book.ISBN == '9627982032'] = 'Larissa Anne Downes'
# Rajout manuel des publishers
book.publisher[book.ISBN == '193169656x'] = 'NovelBooks INC'
book.publisher[book.ISBN == '1931696993'] = 'CreateSpace Independent Publishing Platform'
Nous remarquons qu'il y a plusieurs incohérences dans l'attribut Year:
book['year'].unique()
array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
'2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
'1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
'0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
'1963', '1956', '1970', '1985', '1978', '1973', '1980', '1979',
'1975', '1969', '1961', '1965', '1939', '1958', '1950', '1953',
'1966', '1971', '1959', '1972', '1955', '1957', '1945', '1960',
'1967', '1932', '1924', '1964', '2012', '1911', '1927', '1948',
'1962', '2006', '1952', '1940', '1951', '1931', '1954', '2005',
'1930', '1941', '1944', 'DK Publishing Inc', '1943', '1938',
'1900', '1942', '1923', '1920', '1933', 'Gallimard', '1909',
'1946', '2008', '1378', '2030', '1936', '1947', '2011', '2020',
'1919', '1949', '1922', '1897', '2024', '1376', '1926', '2037'],
dtype=object)
On remarque déjà que des noms de publishers sont affectés à des dates. Si on regarde de plus près..
book.loc[(book['year'] == 'DK Publishing Inc') | (book['year'] =="Gallimard") ,:]
| ISBN | title | author | year | publisher | image_links | |
|---|---|---|---|---|---|---|
| 209344 | 078946697x | DK Readers: Creating the X-Men, How It All Beg... | 2000 | DK Publishing Inc | http://images.amazon.com/images/P/078946697X.0... | NaN |
| 220513 | 2070426769 | Peuple du ciel, suivi de 'Les Bergers\";Jean-M... | 2003 | Gallimard | http://images.amazon.com/images/P/2070426769.0... | NaN |
| 221454 | 0789466953 | DK Readers: Creating the X-Men, How Comic Book... | 2000 | DK Publishing Inc | http://images.amazon.com/images/P/0789466953.0... | NaN |
.. on observe qu'il manque l'auteur et que par conséquent toutes les colonnes après celui-çi sont décalées. Nous devons corriger ceci:
book[['year','publisher','image_links']] = book[['year','publisher', 'image_links']].mask((book['year'] == 'Gallimard') | (book['year'] == 'DK Publishing Inc'), book[['author','year','publisher']].values)
Ici, nous avons simplement appliqué un masque au lignes contenant les 3 lignes à décaler précédemment
# Ajout manuel des auteurs
book.loc[209344 ,'author'] = 'Michael Teitelbaum'
book.loc[221454 ,'author'] = "James Buckley"
book.loc[220513 ,'author'] = "Jean-Marie Gustave Le Clézio"
A présent, le jeux de données Books ne contient plus de valeurs manquantes:
book.isnull().sum()
ISBN 0 title 0 author 0 year 0 publisher 0 image_links 0 dtype: int64
Si on regard de plus prêt les l'attribut year ..
book['year'] = book["year"].astype(int)
print(sorted(book.year.unique()))
[0, 1376, 1378, 1806, 1897, 1900, 1901, 1902, 1904, 1906, 1908, 1909, 1910, 1911, 1914, 1917, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2008, 2010, 2011, 2012, 2020, 2021, 2024, 2026, 2030, 2037, 2038, 2050]
On constate que les dates supérieures à 2012 sont érronées. Il faut modifier ces dates, conforméments à la réalité.
book[book['year']>2012][["ISBN","year"]].values
array([['068160204x', 2020],
['0671746103', 2030],
['0671791990', 2030],
['0870449842', 2030],
['0140301690', 2050],
['068107468x', 2020],
['0140201092', 2050],
['0394701658', 2038],
['3442436893', 2026],
['0590085417', 2021],
['0870446924', 2030],
['0671266500', 2030],
['068471941x', 2020],
['0684718022', 2030],
['0380000059', 2024],
['068471809x', 2037],
['0671740989', 2030]], dtype=object)
#year >2012
correct_years=[1997,1991,1992,1999,1970,1998,1950,1959,1996,1974,1987,1961,1920,1971,1925,1937,1991]
for i,j in zip(book[book['year']>2012][["ISBN","year"]].values,correct_years):
print(i,"---> correct year=",j)
['068160204x' 2020] ---> correct year= 1997 ['0671746103' 2030] ---> correct year= 1991 ['0671791990' 2030] ---> correct year= 1992 ['0870449842' 2030] ---> correct year= 1999 ['0140301690' 2050] ---> correct year= 1970 ['068107468x' 2020] ---> correct year= 1998 ['0140201092' 2050] ---> correct year= 1950 ['0394701658' 2038] ---> correct year= 1959 ['3442436893' 2026] ---> correct year= 1996 ['0590085417' 2021] ---> correct year= 1974 ['0870446924' 2030] ---> correct year= 1987 ['0671266500' 2030] ---> correct year= 1961 ['068471941x' 2020] ---> correct year= 1920 ['0684718022' 2030] ---> correct year= 1971 ['0380000059' 2024] ---> correct year= 1925 ['068471809x' 2037] ---> correct year= 1937 ['0671740989' 2030] ---> correct year= 1991
book['year'][book['year']>2012] = correct_years
# Solution avec boucle for
# for i,j in zip(book[book['year']>2012]["ISBN"].index,correct_years):
# book.loc[i,'year']=j
book['year'][book['year']>2012].values
array([], dtype=int32)
Les dates au dessus de 2012 ont été corrigées. Il reste maintenant les dates inférieure à 1806
book[(book['year']<1806) & (book['year']>0)][["ISBN","year"]]
| ISBN | year | |
|---|---|---|
| 227296 | 9643112136 | 1378 |
| 253457 | 964442011x | 1376 |
book.loc[227296,"year"]=2010
book.loc[253457,"year"]=2011
# year ==0
book[book['year']==0]
| ISBN | title | author | year | publisher | image_links | |
|---|---|---|---|---|---|---|
| 176 | 3150000335 | Kabale Und Liebe | Schiller | 0 | Philipp Reclam, Jun Verlag GmbH | http://images.amazon.com/images/P/3150000335.0... |
| 188 | 342311360x | Die Liebe in Den Zelten | Gabriel Garcia Marquez | 0 | Deutscher Taschenbuch Verlag (DTV) | http://images.amazon.com/images/P/342311360X.0... |
| 288 | 0571197639 | Poisonwood Bible Edition Uk | Barbara Kingsolver | 0 | Faber Faber Inc | http://images.amazon.com/images/P/0571197639.0... |
| 351 | 3596214629 | Herr Der Fliegen (Fiction, Poetry and Drama) | Golding | 0 | Fischer Taschenbuch Verlag GmbH | http://images.amazon.com/images/P/3596214629.0... |
| 542 | 8845229041 | Biblioteca Universale Rizzoli: Sulla Sponda De... | P Coelho | 0 | Fabbri - RCS Libri | http://images.amazon.com/images/P/8845229041.0... |
| ... | ... | ... | ... | ... | ... | ... |
| 270478 | 014029953x | Foe (Essential.penguin S.) | J.M. Coetzee | 0 | Penguin Books Ltd | http://images.amazon.com/images/P/014029953X.0... |
| 270597 | 0340571187 | Postmens House | Maggie Hemingway | 0 | Trafalgar Square | http://images.amazon.com/images/P/0340571187.0... |
| 270778 | 8427201079 | El Misterio De Sittaford | Agatha Christie | 0 | Editorial Molino | http://images.amazon.com/images/P/8427201079.0... |
| 270866 | 0887781721 | Tom Penny | Tony German | 0 | P. Martin Associates | http://images.amazon.com/images/P/0887781721.0... |
| 270880 | 3150013763 | Der Hofmeister | Jakob Lenz | 0 | Philipp Reclam, Jun Verlag GmbH | http://images.amazon.com/images/P/3150013763.0... |
4611 rows × 6 columns
Top 10 Authors which have written the most books.
book_year_count=book.groupby('year',as_index=False).count()[['year','ISBN']]
book_year_count=book_year_count[(book_year_count["year"]>1996) & (book_year_count["year"]<2004)]
book_year_count
| year | ISBN | |
|---|---|---|
| 92 | 1997 | 14873 |
| 93 | 1998 | 15752 |
| 94 | 1999 | 17411 |
| 95 | 2000 | 17214 |
| 96 | 2001 | 17337 |
| 97 | 2002 | 17614 |
| 98 | 2003 | 14330 |
sns.barplot(data=book_year_count,x="year",y="ISBN",palette=palette)
<AxesSubplot: xlabel='year', ylabel='ISBN'>
def plot_count(df, column_name):
plot_column = df[column_name].value_counts()[0:10].reset_index()
plot_column.columns = [column_name, 'Count']
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=column_name, y='Count', data=plot_column, palette=palette)
ax.set_title("Top 10 " + column_name)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()
plot_count(book,"author")
plot_count(book,"publisher")
plot_count(book,"title")
Une fonction qui cherche le ISBN sur la base de google book et sur notre table Book
import urllib.request
import json
import textwrap
from dateutil.parser import parse
import datetime
def search_isbn(isbn):
exist_in_book=isbn in book.ISBN.unique()
if exist_in_book:
info_book=book[book.ISBN==isbn]
base_api_link = "https://www.googleapis.com/books/v1/volumes?q=isbn:"
with urllib.request.urlopen(base_api_link + isbn) as f:
text = f.read()
decoded_text = text.decode("utf-8")
obj = json.loads(decoded_text) # deserializes decoded_text to a Python object
if "items" in obj.keys():
volume_info = obj["items"][0]
else:
return({"ISBN":isbn,'title':info_book.title.values[0],
'author':info_book.author.values[0],'publisher':info_book.publisher.values[0],'date':info_book.year.values[0],"nombre de pages":None,'description':None,'image_links':info_book.image_links.values[0]})
#titre
title=volume_info["volumeInfo"].get("title",None)
if (title == None) & (exist_in_book):
title=info_book.title.values[0]
#auteur
author=list(volume_info["volumeInfo"].get('authors',None))[0]
if (author == None) & (exist_in_book):
author=info_book.author.values[0]
#publisher
publisher=volume_info["volumeInfo"].get('publisher',None)
if (publisher == None) & (exist_in_book):
publisher=info_book.publisher.values[0]
#extraire l'année si on trouve une date :
#year
date=volume_info["volumeInfo"].get('publishedDate',None)
if isinstance(date, datetime.datetime):
date = date.strftime("%Y-%m-%d %H:%M:%S")
else :
date=str(date)
if date.isdigit():
year = int(date)
else:
try:
date = parse(date)
year = date.year
except ValueError:
# La chaîne n'est ni un nombre ni une date valide
year = None
if (year == None) & (exist_in_book):
year=info_book.year.values[0]
#categories
categories = volume_info["volumeInfo"].get("categories", None)
#nbr_pages
nbr_pages= volume_info["volumeInfo"].get("pageCount", None)
#image link
if exist_in_book:
image_links=info_book.image_links.values[0]
else:
image_links = volume_info["volumeInfo"].get("imageLinks", None)
if isinstance(image_links,dict):
image_links=image_links.get('thumbnail',None)
#description
description = volume_info["volumeInfo"].get("description",None)
return({"ISBN":isbn,'title':title,'author':author,'publisher':publisher,'date':year,"nombre de pages":nbr_pages,'description':description,'image_links':image_links})
user.rename(columns={"User-ID":'user_id','Location':'location','Age':'age'},inplace=True)
user.columns
Index(['user_id', 'location', 'age'], dtype='object')
user.isna().sum()
user_id 0 location 0 age 110762 dtype: int64
user.age.describe()
count 168096.000000 mean 34.751434 std 14.428097 min 0.000000 25% 24.000000 50% 32.000000 75% 44.000000 max 244.000000 Name: age, dtype: float64
user.age=user.age.astype(float)
sns.histplot(user.age, bins=20, kde=True)
plt.title("Distribution de densité de l'âge avant imputation")
plt.xlabel("Âge")
plt.ylabel("Densité")
Text(0, 0.5, 'Densité')
mean=round(user[(user.age>10) & (user.age < 80)]['age'].mean())
user.loc[(user.age<10) | (user.age > 80),'age']=mean
user.age.fillna(mean,inplace=True)
sns.histplot(user.age, bins=20, kde=True)
plt.title("Distribution de densité de l'âge apres imputation")
plt.xlabel("Âge")
plt.ylabel("Densité")
Text(0, 0.5, 'Densité')
user.location
0 nyc, new york, usa
1 stockton, california, usa
2 moscow, yukon territory, russia
3 porto, v.n.gaia, portugal
4 farnborough, hants, united kingdom
...
278853 portland, oregon, usa
278854 tacoma, washington, united kingdom
278855 brampton, ontario, canada
278856 knoxville, tennessee, usa
278857 dublin, n/a, ireland
Name: location, Length: 278858, dtype: object
user.location.unique()
array(['nyc, new york, usa', 'stockton, california, usa',
'moscow, yukon territory, russia', ...,
'sergnano, lombardia, italy', 'stranraer, n/a, united kingdom',
'tacoma, washington, united kingdom'], dtype=object)
user.location.nunique()
57339
user['country']=user.location.str.extract(r'\,+\s?(\w*\s?\w*)\"*$')
user.head(50)
| user_id | location | age | country | |
|---|---|---|---|---|
| 0 | 1 | nyc, new york, usa | 35.0 | usa |
| 1 | 2 | stockton, california, usa | 18.0 | usa |
| 2 | 3 | moscow, yukon territory, russia | 35.0 | russia |
| 3 | 4 | porto, v.n.gaia, portugal | 17.0 | portugal |
| 4 | 5 | farnborough, hants, united kingdom | 35.0 | united kingdom |
| 5 | 6 | santa monica, california, usa | 61.0 | usa |
| 6 | 7 | washington, dc, usa | 35.0 | usa |
| 7 | 8 | timmins, ontario, canada | 35.0 | canada |
| 8 | 9 | germantown, tennessee, usa | 35.0 | usa |
| 9 | 10 | albacete, wisconsin, spain | 26.0 | spain |
| 10 | 11 | melbourne, victoria, australia | 14.0 | australia |
| 11 | 12 | fort bragg, california, usa | 35.0 | usa |
| 12 | 13 | barcelona, barcelona, spain | 26.0 | spain |
| 13 | 14 | mediapolis, iowa, usa | 35.0 | usa |
| 14 | 15 | calgary, alberta, canada | 35.0 | canada |
| 15 | 16 | albuquerque, new mexico, usa | 35.0 | usa |
| 16 | 17 | chesapeake, virginia, usa | 35.0 | usa |
| 17 | 18 | rio de janeiro, rio de janeiro, brazil | 25.0 | brazil |
| 18 | 19 | weston, , | 14.0 | |
| 19 | 20 | langhorne, pennsylvania, usa | 19.0 | usa |
| 20 | 21 | ferrol / spain, alabama, spain | 46.0 | spain |
| 21 | 22 | erfurt, thueringen, germany | 35.0 | germany |
| 22 | 23 | philadelphia, pennsylvania, usa | 35.0 | usa |
| 23 | 24 | cologne, nrw, germany | 19.0 | germany |
| 24 | 25 | oakland, california, usa | 55.0 | usa |
| 25 | 26 | bellevue, washington, usa | 35.0 | usa |
| 26 | 27 | chicago, illinois, usa | 32.0 | usa |
| 27 | 28 | freiburg, baden-wuerttemberg, germany | 24.0 | germany |
| 28 | 29 | cuernavaca, alabama, mexico | 19.0 | mexico |
| 29 | 30 | anchorage, alaska, usa | 24.0 | usa |
| 30 | 31 | shanghai, n/a, china | 20.0 | china |
| 31 | 32 | portland, oregon, usa | 35.0 | usa |
| 32 | 33 | costa mesa, california, usa | 34.0 | usa |
| 33 | 34 | london, england, united kingdom | 35.0 | united kingdom |
| 34 | 35 | grafton, wisconsin, usa | 17.0 | usa |
| 35 | 36 | montreal, quebec, canada | 24.0 | canada |
| 36 | 37 | san sebastian, n/a, spain | 23.0 | spain |
| 37 | 38 | viterbo, lazio, italy | 34.0 | italy |
| 38 | 39 | cary, north carolina, usa | 35.0 | usa |
| 39 | 40 | tonawanda, new york, usa | 32.0 | usa |
| 40 | 41 | santee, california, usa | 14.0 | usa |
| 41 | 42 | appleton, wisconsin, usa | 17.0 | usa |
| 42 | 43 | méxico, méxico city, distrito federal | 35.0 | distrito federal |
| 43 | 44 | black mountain, north carolina, usa | 51.0 | usa |
| 44 | 45 | berlin, n/a, germany | 35.0 | germany |
| 45 | 46 | heidelberg, baden-wuerttemberg, germany | 31.0 | germany |
| 46 | 47 | vicenza, veneto, italy | 21.0 | italy |
| 47 | 48 | chicago, illinois, usa | 35.0 | usa |
| 48 | 49 | rome, rome, italy | 35.0 | italy |
| 49 | 50 | london, england, united kingdom | 17.0 | united kingdom |
user.country.nunique()
529
user.isnull().sum()
user_id 0 location 0 age 0 country 368 dtype: int64
user['country']=user['country'].astype('str')
user['country'].replace(['','01776','02458','19104','23232','30064','85021','87510','alachua','america','austria','autralia','cananda','geermany','italia','united kindgonm','united sates','united staes','united state','united states','us'],
['other','usa','usa','usa','usa','usa','usa','usa','usa','usa','australia','australia','canada','germany','italy','united kingdom','usa','usa','usa','usa','usa'],inplace=True)
user.location
0 nyc, new york, usa
1 stockton, california, usa
2 moscow, yukon territory, russia
3 porto, v.n.gaia, portugal
4 farnborough, hants, united kingdom
...
278853 portland, oregon, usa
278854 tacoma, washington, united kingdom
278855 brampton, ontario, canada
278856 knoxville, tennessee, usa
278857 dublin, n/a, ireland
Name: location, Length: 278858, dtype: object
def valide(string):
if string == ' ' or string == '' or string=='n/a' or string == ',':
return(False)
return True
list_location=user.location.str.split(", ")
list_location
0 [nyc, new york, usa]
1 [stockton, california, usa]
2 [moscow, yukon territory, russia]
3 [porto, v.n.gaia, portugal]
4 [farnborough, hants, united kingdom]
...
278853 [portland, oregon, usa]
278854 [tacoma, washington, united kingdom]
278855 [brampton, ontario, canada]
278856 [knoxville, tennessee, usa]
278857 [dublin, n/a, ireland]
Name: location, Length: 278858, dtype: object
#correction des adresses qui contiennent 5 elements et
#les remplacer par une adresse de 3 elements city, state, country
user.loc[5641,'location'] = 'other, other, mexico'
user.loc[6557,'location'] = 'schiltigheim, alsace, france'
user.loc[9695,'location'] = 'other, other, mexico'
user.loc[12512,'location'] = 'nanaimo, british columbia, canada'
user.loc[43946,'location'] = 'vaubuzin, frolois, france'
user.loc[46964,'location'] = "seaview parkway, gulf view, trinidad and tobago"
user.loc[48738,'location'] = "mainz, rhineland palatinate, germany"
user.loc[55159,'location'] = 'newport, Wales, united kingdom'
user.loc[61675,'location'] = 'bracknell, berkshire, united kingdom'
user.loc[65857,'location'] ="geneva, florida, usa"
user.loc[66143,'location'] = "vita, sicilia, italy"
user.loc[66147,'location'] ='ioulís, kea, greece'
user.loc[82756,'location'] = 'mizunami, gifu, japan'
user.loc[83738,'location'] = 'listowel, ontario, canada'
user.loc[85108,'location'] = 'baloira, pontevedra, spain'
user.loc[97080,'location'] = 'bonn, rhine westphalia, germany'
user.loc[98124,'location'] = 'king ct, green brook, usa'
user.loc[17884,'location'] = 'Mumbai, maharashtra, india'
user.loc[120092,'location'] = 'merrick, new york, united states'
user.loc[120835,'location'] = 'ndianapolis, indiana, united states'
user.loc[131301,'location'] = 'huelva, andalucia, españa'
user.loc[154405,'location'] = 'san rafael arriba, san josé, costa rica'
user.loc[155588,'location'] = 'other, dhaka, bangladesh'
user.loc[155733,'location'] = 'toronto, Ontario, canada'
user.loc[275707,'location'] = 'somers, victoria, australia'
user.loc[274054,'location'] = 'enschede, budapest, hungary'
user.loc[273290,'location'] = 'san ramon, alajuela, costa rica'
user.loc[270111,'location'] = 'bonn, köln, germany'
user.loc[266273,'location'] = 'isabela, basilan, philippines'
user.loc[264917,'location'] = 'zelezny brod, koberovy, czech republic'
user.loc[264458,'location'] = 'coimbra, oliveira do hospital, portugal'
user.loc[263333,'location'] = 'pooc occidental, tubigon, philippines'
user.loc[261650,'location'] = 'méxico, d.f., mexico'
user.loc[258217,'location'] = 'alajuela, san carlos, costa rica'
user.loc[255932,'location'] = 'san josé, san josé, costa rica'
user.loc[255786,'location'] = 'kristianstad, åhus, sweden'
user.loc[249300,'location'] = 'accra, accra, ghana'
user.loc[235579,'location'] = 'quetigny, côte d`or, france'
user.loc[230891,'location'] = 'other, other, france'
user.loc[228791,'location'] = 'mexico, d.f., mexico'
user.loc[225434,'location'] = 'lawrence, kansas, usa'
user.loc[212652,'location'] = 'other, other, usa'
user.loc[203908,'location'] = 'gold coast, queensland, australia'
user.loc[179621,'location'] = 'méxico, d.f., mexico'
user.loc[175617,'location'] = 'south sulawesi, makassar, indonesia'
user.loc[164758,'location'] = 'mexico, d.f., mexico'
user.loc[161512,'location'] = 'ota, ogun state, nigeria'
user.loc[158135,'location'] = 'durbanville, cape town, south africa'
#correction des adresse avec des elements differents de 5 et 3 et les transformer en une adresse avec 3 elements
user.loc[76342,'location']='tijuana, baja california, mexico'
user.loc[122814,'location']='tubod, lanao del norte, philippines'
user.loc[258822,'location']='northern ireland, lurgan, united kingdom'
user.loc[194259,'location']='daejeon, other, south korea'
user.loc[215970,'location']='silver spring, maryland, usa'
user.loc[241302,'location']='thane, maharastra, india'
user.loc[134376,'location']='lawrenceville, other, usa'
list_location=user.location.str.split(", ")
pip install geopy
Requirement already satisfied: geopy in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (2.3.0) Requirement already satisfied: geographiclib<3,>=1.52 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from geopy) (2.0) Note: you may need to restart the kernel to use updated packages.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="my_app")
from itertools import combinations
string='moscow, yukon territory, russia'
def lat_long(string):
location=geolocator.geocode(string, timeout=10)
if location==None:
List=list(combinations(string.split(","),2))
for i in range(len(List)):
List[i]=List[i][0]+List[i][1]
location=geolocator.geocode(List[i], timeout=10)
if location!=None:
break
if location==None:
return (pd.Series({'lat':None,'long':None}))
else:
return pd.Series({'lat':float(location.raw['lat']),'long':float(location.raw['lon'])})
test=user[:10]
test
| user_id | location | age | country | |
|---|---|---|---|---|
| 0 | 1 | nyc, new york, usa | 35.0 | usa |
| 1 | 2 | stockton, california, usa | 18.0 | usa |
| 2 | 3 | moscow, yukon territory, russia | 35.0 | russia |
| 3 | 4 | porto, v.n.gaia, portugal | 17.0 | portugal |
| 4 | 5 | farnborough, hants, united kingdom | 35.0 | united kingdom |
| 5 | 6 | santa monica, california, usa | 61.0 | usa |
| 6 | 7 | washington, dc, usa | 35.0 | usa |
| 7 | 8 | timmins, ontario, canada | 35.0 | canada |
| 8 | 9 | germantown, tennessee, usa | 35.0 | usa |
| 9 | 10 | albacete, wisconsin, spain | 26.0 | spain |
from tqdm import tqdm
tqdm.pandas()
test[['lat','long']]=(0,0)
test[['lat','long']]=test.location.progress_apply(lat_long)
100%|██████████████████████████████████████████| 10/10 [00:09<00:00, 1.08it/s]
test
| user_id | location | age | country | lat | long | |
|---|---|---|---|---|---|---|
| 0 | 1 | nyc, new york, usa | 35.0 | usa | 40.712728 | -74.006015 |
| 1 | 2 | stockton, california, usa | 18.0 | usa | 37.957702 | -121.290780 |
| 2 | 3 | moscow, yukon territory, russia | 35.0 | russia | 55.750446 | 37.617494 |
| 3 | 4 | porto, v.n.gaia, portugal | 17.0 | portugal | 41.149451 | -8.610788 |
| 4 | 5 | farnborough, hants, united kingdom | 35.0 | united kingdom | 51.291869 | -0.753984 |
| 5 | 6 | santa monica, california, usa | 61.0 | usa | 34.019470 | -118.491227 |
| 6 | 7 | washington, dc, usa | 35.0 | usa | 38.895037 | -77.036543 |
| 7 | 8 | timmins, ontario, canada | 35.0 | canada | 48.477473 | -81.330414 |
| 8 | 9 | germantown, tennessee, usa | 35.0 | usa | 35.086758 | -89.810086 |
| 9 | 10 | albacete, wisconsin, spain | 26.0 | spain | 38.995092 | -1.855915 |
#drop location column
user.drop('location',axis=1,inplace=True)
print(rating.columns)
rating.rename(columns={"User-ID":"id","Book-Rating":"rating"},inplace=True)
print('\nRename to --->\n', rating.columns)
Index(['User-ID', 'ISBN', 'Book-Rating'], dtype='object') Rename to ---> Index(['id', 'ISBN', 'rating'], dtype='object')
On remplace les 'X' par 'x' dans l'ISBN..
rating['ISBN']=rating['ISBN'].str.replace("X","x")
rating.head()
| id | ISBN | rating | |
|---|---|---|---|
| 0 | 276725 | 034545104x | 0 |
| 1 | 276726 | 0155061224 | 5 |
| 2 | 276727 | 0446520802 | 0 |
| 3 | 276729 | 052165615x | 3 |
| 4 | 276729 | 0521795028 | 6 |
# value counts de longueur de ISBN
long=rating.ISBN.apply(len).value_counts()
x=list(long.to_frame().index)
y=long.to_frame().values
y=list(y.reshape(-1))
if 10 in x :
y.pop(x.index(10))
x.pop(x.index(10))
plt.bar(x,y)
<BarContainer object of 6 artists>
list_to_drop=[]
#apres verification sur google les ISBN avec 9 elements et qui ne contiennet pas x a la fin ne corresponds pas a des livre
for i in range(len(rating)):
if len(rating.iloc[i,1])==9:
if rating.iloc[i,1].isnumeric()==False and rating.iloc[i,1][-1]!='x':
list_to_drop.append(i)
#ISBN
for i in range(len(rating)):
if len(rating.iloc[i,1])==9:
if rating.iloc[i,1].isnumeric()==False and rating.iloc[i,1][-1]=='x':
if len(book[book['ISBN']=='0'+rating.iloc[i,1]])!=0: # il y a correspondance
rating.iloc[i,1]='0'+rating.iloc[i,1]
else:
list_to_drop.append(i)
for i in range(len(rating)):
if len(rating.iloc[i,1])==9:
if rating.iloc[i,1].isnumeric()==True :
if len(book[(book['ISBN']=='0'+rating.iloc[i,1]) | (book['ISBN']==rating.iloc[i,1]+'x')])!=0:
if len(book[book['ISBN']=='0'+rating.iloc[i,1]])!=0:
rating.iloc[i,1]='0'+rating.iloc[i,1]
if len(book[book['ISBN']==rating.iloc[i,1]+'x'])!=0:
rating.iloc[i,1]=rating.iloc[i,1]+'x'
else:
list_to_drop.append(i)
rating=rating.drop(index=list_to_drop)
nun = rating.groupby(['ISBN','id']).agg(['count','nunique'])
nun.loc[nun[('rating','count')] == 2]
| rating | |||
|---|---|---|---|
| count | nunique | ||
| ISBN | id | ||
| 0060177209 | 194027 | 2 | 1 |
| 006101351x | 11676 | 2 | 2 |
| 0064400085 | 127429 | 2 | 1 |
| 0140440151 | 273718 | 2 | 1 |
| 034541389x | 76767 | 2 | 2 |
| 0373244487 | 89098 | 2 | 2 |
| 038549081x | 83596 | 2 | 1 |
| 0385504209 | 11676 | 2 | 2 |
| 039470388x | 117539 | 2 | 1 |
| 0399501487 | 52584 | 2 | 1 |
| 0425031748 | 102967 | 2 | 1 |
| 044021145x | 202277 | 2 | 1 |
| 0446611778 | 11676 | 2 | 2 |
| 0452261368 | 245827 | 2 | 2 |
| 051512608x | 26407 | 2 | 1 |
| 051513287x | 190925 | 2 | 1 |
| 051513306x | 180549 | 2 | 1 |
| 051513628x | 11676 | 2 | 2 |
| 055335387x | 11676 | 2 | 2 |
| 055358099x | 183145 | 2 | 2 |
| 059045661x | 198711 | 2 | 1 |
| 0671028375 | 177861 | 2 | 2 |
| 067976402x | 127914 | 2 | 1 |
| 0843947624 | 26535 | 2 | 1 |
| 086625370x | 245827 | 2 | 1 |
| 086625501x | 245827 | 2 | 1 |
| 088396631x | 143175 | 2 | 1 |
| 166596 | 2 | 1 | |
| 277879 | 2 | 2 | |
| 325723158x | 125224 | 2 | 2 |
| 342660700x | 11676 | 2 | 2 |
| 842046435x | 11676 | 2 | 2 |
| 881787017x | 83879 | 2 | 2 |
# Remove duplicates
rating = rating.drop_duplicates(subset=['id', 'ISBN'])
rating['ISBN'].apply(len)
0 10
1 10
2 10
3 10
4 10
..
1149775 10
1149776 10
1149777 10
1149778 10
1149779 11
Name: ISBN, Length: 1145621, dtype: int64
long=pd.DataFrame(rating['ISBN'].apply(len).reset_index())
long.rename(columns={'ISBN':'longueur ISBN'},inplace=True)
plot_count(long[long['longueur ISBN']!=10],"longueur ISBN")
#Hence segragating implicit and explict ratings datasets
rating_explicit = rating[rating['rating'] != 0]
rating['rating_implicit'] = rating['rating'].apply(lambda x: 0 if (x <6) else 1)
rating.rating_implicit.value_counts()
0 783784 1 361837 Name: rating_implicit, dtype: int64
rating.to_pickle('pickle/rating.pkl')
rating_explicit.to_pickle('pickle/rating_explicit.pkl')
rating_explicit.shape
(431903, 3)
From now, we are going to work with explicit ratings
rating = pd.read_pickle('pickle/rating_explicit.pkl')
removed_ratings = 0
while True:
### Delete users who give the same rating to all the books they have reviewed:
# We don't remove users that give only one rating_explicit.
# Group the data frame by id and count the number of unique ratings
rating_counts = rating.groupby('id')['rating'].agg(['count', 'nunique'])
# Filter the data frame to only include users who have given the same rating to all books
same_rating_users = rating_counts[(rating_counts['nunique'] == 1) & (rating_counts['count'] > 1)]
# Select the rows from the original DataFrame that correspond to these users
to_remove_1 = rating[rating['id'].isin(same_rating_users.index)]
### Delete users who have rated less than 8 books:
# Group the DataFrame by user ID and count the number of books rated by each user
user_counts = rating.groupby('id')['ISBN'].count()
# Filter the DataFrame to select only users who have rated less than 8 books
less_than_8_books = user_counts[user_counts < 8]
# Select the rows from the original DataFrame that correspond to these users
to_remove_2 = rating[rating['id'].isin(less_than_8_books.index)]
### Delete books that are rated by less than 8 users.
# Group the DataFrame by ISBN and count the number of users who rated each book
book_counts = rating.groupby('ISBN')['id'].count()
# Filter the DataFrame to select only books that are rated by less than 8 users
less_than_8_users = book_counts[book_counts < 8]
# Select the rows from the original DataFrame that correspond to these books
to_remove_3 = rating[rating['ISBN'].isin(less_than_8_users.index)]
### Remove dataframes to_remove_1,2,3
merged_df = pd.concat([to_remove_1, to_remove_2], axis=0)
merged_df = pd.concat([merged_df, to_remove_3], axis=0)
rating_cleaned = pd.concat([rating, merged_df, merged_df]).drop_duplicates(keep=False)
if len(rating) == len(rating_cleaned):
break
else:
removed_ratings += len(rating) - len(rating_cleaned)
rating = rating_cleaned
print('Number of rows deleted: ', removed_ratings)
Number of rows deleted: 369495
import urllib.request
from PIL import Image
from io import BytesIO
url = "https://images.amazon.com/images/P/0439139597.01.MZZZZZZZ.jpg"
def plot_image(url):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
req = urllib.request.Request(url, headers=headers)
try:
response = urllib.request.urlopen(req)
if response.status == 200:
img_data = response.read()
img = Image.open(BytesIO(img_data))
elif response.status == 304:
print("Image not modified")
else:
print(f"Unexpected status code: {response.status}")
except urllib.error.HTTPError as e:
print(f"HTTP error: {e.code}")
except urllib.error.URLError as e:
print(f"URL error: {e.reason}")
return(img)
search_isbn("0345413873")
{'ISBN': '0345413873',
'title': 'Monster',
'author': 'Jonathan Kellerman',
'publisher': 'Random House Digital, Inc.',
'date': 2000,
'nombre de pages': 420,
'description': "Dr. Alex Delaware and Detective Milo Sturgis must make sense of a nonfunctional psychotic's ramblings that seem to predict the horrific killings happening on the streets of Los Angeles.",
'image_links': 'http://images.amazon.com/images/P/0345413873.01.LZZZZZZZ.jpg'}
import requests
from PIL import Image
def display_popular_books(model,df=rating_explicit, n=5):
recommendation=[]
try:#### AJOUT
recommended=model()
except TypeError: #### AJOUT
recommended = model #### AJOUT
for i in range(len(recommended)):
d=search_isbn(recommended.loc[i,"ISBN"])
d['Score']=recommended.loc[i,"Score"]
recommendation.append(d)
recommendation=pd.DataFrame(recommendation)
display(recommendation)
fig, ax = plt.subplots(1, n, figsize=(17, 5))
fig.suptitle("MOST POPULAR {} BOOKS".format(n), fontsize=40, color=palette[0])
for i in range(len(recommendation)):
url = recommendation["image_links"][i]
img = plot_image(url)
ax[i].imshow(img)
ax[i].axis("off")
ax[i].set_title("RATING: {} ".format(round(recommendation.loc[i,'Score'],4)),y=-0.20,color=palette[1],fontsize=10)
plt.show()
The popularity-based recommendation system uses current trends to suggest items. For example, if there is a book that is typically purchased by every new user, it is likely that this book will be recommended to the user who just signed up.
The formula for weighting books is as follows:
$$Weighted Rating(WR)=\frac{vR}{(v+m)} + \frac{mC}{(v+m)}$$where, v is the number of votes for the book ; m is the minimum number of votes required to appear in the ranking; R is the average score for the book; and C is the average vote for the entire report.
Thus, we find the values of v, m, R and C.
rating_explicit = rating
# Create column Rating average
rating_explicit['Avg_Rating']=rating_explicit.groupby('ISBN')['rating'].transform('mean')
# Create column Rating sum
rating_explicit['nbr_ratings_per_book']=rating_explicit.groupby('ISBN')['rating'].transform('count')
C= rating_explicit['Avg_Rating'].mean()
m= rating_explicit['nbr_ratings_per_book'].quantile(0.90)
def weighted_rating(x, m=m, C=C):
v = x['nbr_ratings_per_book']
R = x['Avg_Rating']
return (v/(v+m) * R) + (m/(m+v) * C)
def mean(x):
R = x['Avg_Rating']
return (R)
#hyperparametres
nbr_recommendation=5
estimateur=weighted_rating
rating_explicit.head()
| id | ISBN | rating | Avg_Rating | nbr_ratings_per_book | |
|---|---|---|---|---|---|
| 911 | 277157 | 0312979517 | 5 | 7.000000 | 10 |
| 913 | 277157 | 0316154059 | 5 | 7.785714 | 14 |
| 917 | 277157 | 0345452550 | 7 | 7.764706 | 17 |
| 928 | 277157 | 0399146504 | 7 | 7.500000 | 8 |
| 936 | 277157 | 0399148639 | 6 | 7.944444 | 18 |
Nous avons utilisé le 90ème percentile comme critère de sélection. Cela signifie qu'un livre doit avoir plus de votes que 90% des livres de la liste pour figurer dans le classement. Il y a 38 570 livres qui répondent à ce critère et nous allons calculer notre métrique pour chacun d'eux en définissant une fonction "weighted_rating()" et en calculant une nouvelle fonction "score".
#based on popularity
def simple_recommendation_system(nbr_recommendation=nbr_recommendation,ratings=rating_explicit,estimateur=weighted_rating,m=m):
book_frequent = ratings.loc[ratings['nbr_ratings_per_book'] >= m]
book_frequent.loc[:,'Score'] = book_frequent.apply(estimateur,axis=1)
book_frequent = book_frequent.sort_values('Score', ascending=False).drop_duplicates('ISBN').iloc[:nbr_recommendation,:][['ISBN','Score']]
return(book_frequent.reset_index(drop=True))
display_popular_books(simple_recommendation_system)
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0439139597 | Harry Potter | J. K. Rowling | Scholastic | 2009 | NaN | Collects the complete series that relates the ... | http://images.amazon.com/images/P/0439139597.0... | 8.666633 |
| 1 | 0439136350 | Harry Potter | J. K. Rowling | Scholastic | 1999 | 435.0 | During his third year at Hogwarts School for w... | http://images.amazon.com/images/P/0439136350.0... | 8.646057 |
| 2 | 043935806x | Harry Potter and the Order of the Phoenix | J. K. Rowling | Scholastic | 2003 | 1232.0 | Collects the complete series that relates the ... | http://images.amazon.com/images/P/043935806X.0... | 8.629424 |
| 3 | 059035342x | Harry Potter and the Sorcerer's Stone | J. K. Rowling | Arthur A. Levine Books | 1998 | 324.0 | Rescued from the outrageous neglect of his aun... | http://images.amazon.com/images/P/059035342X.0... | 8.618802 |
| 4 | 0446310786 | To Kill a Mockingbird | Harper Lee | Grand Central Publishing | 1988 | 384.0 | The unforgettable novel of a childhood in a sl... | http://images.amazon.com/images/P/0446310786.0... | 8.601420 |
Ce système de recommendation peut être utilisé pour un nouvel utilisateur
user_item = rating_explicit.pivot(index='id', columns='ISBN', values='rating')
user_item.fillna(0, inplace = True)
user_item = user_item.astype(np.int32)
#checking first few rows
user_item.head(5)
| ISBN | 000649840x | 0007154615 | 0020198906 | 0020199600 | 0020427859 | 0020442009 | 0020442203 | 0020442300 | 0020442602 | 002542730x | ... | 1857989546 | 1861976127 | 1878424114 | 1878424319 | 1880418568 | 1881273156 | 1885171080 | 1896860982 | 193156146x | 1931561648 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 243 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 254 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 507 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 638 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 805 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 3488 columns
n_users = user_item.shape[0] #considering only those users who gave explicit ratings
n_books = user_item.shape[1]
#rechecking the sparsity
sparsity=1.0-len(user_item)/float(user_item.shape[0]*n_books)
print ('The sparsity level of Book Crossing dataset is ' + str(sparsity*100) + ' %')
The sparsity level of Book Crossing dataset is 99.97133027522935 %
global metric,k
k=10
metric='cosine'
rating_explicit
| id | ISBN | rating | Avg_Rating | nbr_ratings_per_book | |
|---|---|---|---|---|---|
| 911 | 277157 | 0312979517 | 5 | 7.000000 | 10 |
| 913 | 277157 | 0316154059 | 5 | 7.785714 | 14 |
| 917 | 277157 | 0345452550 | 7 | 7.764706 | 17 |
| 928 | 277157 | 0399146504 | 7 | 7.500000 | 8 |
| 936 | 277157 | 0399148639 | 6 | 7.944444 | 18 |
| ... | ... | ... | ... | ... | ... |
| 1149714 | 276688 | 0553575090 | 7 | 8.125000 | 8 |
| 1149715 | 276688 | 0553575104 | 6 | 6.500000 | 10 |
| 1149738 | 276688 | 0688156134 | 8 | 7.200000 | 10 |
| 1149743 | 276688 | 0836218655 | 10 | 8.153846 | 13 |
| 1149744 | 276688 | 0836236688 | 10 | 8.166667 | 12 |
62408 rows × 5 columns
User-based recommendation systems are a type of collaborative filtering method used to generate recommendations for users based on the preferences of similar users. The main idea behind user-based recommendation systems is to identify users who have similar preferences to a target user and recommend items that these similar users have liked.
More preprocessing if needed..
# on consieded rate > 100
#user >100
# print(rating_explicit.shape[0])
# count1=rating_explicit.id.value_counts()
# rating_explicit=rating_explicit[rating_explicit.id.isin(count1[count1>40].index)]
# count2=rating_explicit.ISBN.value_counts()
# rating_explicit=rating_explicit[rating_explicit.ISBN.isin(count2[count2>10].index)]
# rating_explicit
user_item = rating_explicit.pivot(index='id', columns='ISBN', values='rating')
user_item.fillna(0, inplace = True)
user_item = user_item.astype(np.int32)
from sklearn.neighbors import NearestNeighbors
def findksimilarusers(user_id, ratings, metric = 'cosine', k=k):
similarities=[]
indices=[]
model_knn = NearestNeighbors(metric = metric, algorithm = 'brute')
model_knn.fit(ratings)
loc = ratings.index.get_loc(user_id)
distances, indices = model_knn.kneighbors(ratings.iloc[loc, :].values.reshape(1, -1), n_neighbors = k+1)
similarities = 1-distances.flatten()
return similarities,indices
def predict_userbased(user_id, item_id, ratings, metric='cosine', k=1):
user_loc = ratings.index.get_loc(user_id)
item_loc = ratings.columns.get_loc(item_id)
similarities, indices = findksimilarusers(user_id, ratings, metric, k)
# Calculate the mean rating for the target user
mean_rating = ratings.iloc[user_loc,:].mean()
# Calculate the weighted sum of similarities and ratings for the target item
weighted_sum = 0
if len(indices[0]) > 0:
for i in range(len(indices[0])):
if indices[0][i] != user_loc:
# Check if the i-th similar user rated the item
if not np.isnan(ratings.iloc[indices[0][i], item_loc]):
similarity = similarities[i]
rating = ratings.iloc[indices[0][i], item_loc]
weighted_sum += similarity * rating
# Calculate the predicted rating as the mean rating plus the weighted sum
prediction = mean_rating + (weighted_sum / (np.sum(similarities)-1))
return(prediction)
predict=predict_userbased(274301,"0553280368",user_item)
predict
9.16227064220183
true=user_item.loc[274301,"0553280368"]
print("true value",true,"|| prediction",predict)
true value 10 || prediction 9.16227064220183
def recommendItem(user_id, rating, metric=metric):
if (user_id not in rating.index.values):
print ("user id must be in: ",rating.index.values)
else:
L = {}
for i in range(rating.shape[1]):
if (rating[rating.columns[i]][user_id] !=0): #not rated already
p = predict_userbased(user_id, rating.columns[i] ,rating, metric)
L.add(p)
if p==None:
L.add(-1)
else:
L.append(p) #for not enough similar users
else:
L.append(-1) #for already rated items
L = (sorted(L).reverse())
return(L)
def prediction(nbr_recommendation=nbr_recommendation,ratings=rating_explicit):
user_id=int(input("user_id : "))
if (user_id not in user_item.index.values):
print ("user id must be in: ",user_item.index.values)
else:
L=[]
for i in range(user_item.shape[1]):
if (user_item[user_item.columns[i]][user_id] !=0): #not rated already
p = predict_userbased(user_id, user_item.columns[i] ,user_item, metric)
L.append((user_item.columns[i],p))
L=pd.DataFrame(L).sort_values(by=[1],ascending=False)
L.rename(columns={0:"ISBN",1:"Score"},inplace=True)
L=L.reset_index(drop=True)
return(L[:nbr_recommendation])
display_popular_books(prediction)
user_id : 193458
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0064471047 | The Lion, the Witch and the Wardrobe (rack) | C. S. Lewis | HarperCollins | 1994 | 228 | They open a door and enter a world Narnia ... ... | http://images.amazon.com/images/P/0064471047.0... | 10.08945 |
| 1 | 0064471063 | The Horse and His Boy (rack) | C. S. Lewis | HarperCollins | 1994 | 256 | An orphaned boy and a kidnapped horse gallop f... | http://images.amazon.com/images/P/0064471063.0... | 10.08945 |
| 2 | 0064471071 | The Voyage of the Dawn Treader (rack) | C. S. Lewis | HarperCollins | 1994 | 292 | A voyage to the very ends of the world Narnia ... | http://images.amazon.com/images/P/0064471071.0... | 10.08945 |
| 3 | 006447108x | The Last Battle (rack) | C. S. Lewis | HarperCollins | 1994 | 240 | The last battle is the greatest of all battles... | http://images.amazon.com/images/P/006447108X.0... | 10.08945 |
| 4 | 0064471098 | The Silver Chair (rack) | C. S. Lewis | HarperCollins | 1994 | 272 | Narnia ... where giants wreak havoc ... where ... | http://images.amazon.com/images/P/0064471098.0... | 10.08945 |
Item-based recommendation systems are a type of collaborative filtering method used to generate recommendations for users based on their past interactions with items. The main idea behind item-based recommendation systems is to identify items that are similar to each other based on the users who have interacted with them and use this similarity to make recommendations.
def findksimilaritems( ratings=user_item, metric = metric, k=5):
item_id=input(" item_id ")
ratings=ratings.T
loc = ratings.index.get_loc(item_id)
model_knn = NearestNeighbors(metric = metric, algorithm = 'brute')
model_knn.fit(ratings)
distances, indices = model_knn.kneighbors(ratings.iloc[loc, :].values.reshape(1, -1), n_neighbors = k+1)
distances=distances[0]
indices=indices[0]
similarity=1-distances
L=pd.DataFrame({"ISBN":indices,"Score":similarity})
L.ISBN=L.ISBN.apply(lambda x : user_item.columns[x])
return L[1:].reset_index(drop=True)
display_popular_books(findksimilaritems) # Harry Potter 2009 ISBN: 0439139597
item_id 0439139597
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0439136350 | Harry Potter | J. K. Rowling | Scholastic | 1999 | 435 | During his third year at Hogwarts School for w... | http://images.amazon.com/images/P/0439136350.0... | 0.640892 |
| 1 | 0439064864 | Harry Potter and the Chamber of Secrets | J. K. Rowling | Scholastic | 1999 | 364 | After 10 miserable years with his aunt and unc... | http://images.amazon.com/images/P/0439064864.0... | 0.618877 |
| 2 | 0590353403 | Harry Potter and the Sorcerer's Stone | J. K. Rowling | Arthur A. Levine Books | 1998 | 328 | Rescued from the outrageous neglect of his aun... | http://images.amazon.com/images/P/0590353403.0... | 0.462307 |
| 3 | 043935806x | Harry Potter and the Order of the Phoenix | J. K. Rowling | Scholastic | 2003 | 1232 | Collects the complete series that relates the ... | http://images.amazon.com/images/P/043935806X.0... | 0.444506 |
| 4 | 059035342x | Harry Potter and the Sorcerer's Stone | J. K. Rowling | Arthur A. Levine Books | 1998 | 324 | Rescued from the outrageous neglect of his aun... | http://images.amazon.com/images/P/059035342X.0... | 0.228176 |
We're going to use Surprise which is a Python library that provides a set of algorithms for building and evaluating recommendation systems. It includes several matrix factorization methods, including NMF, as well as other techniques such as Singular Value Decomposition (SVD) and Alternating Least Squares (ALS).
Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three smaller matrices. The goal of SVD is to find the low-rank approximation of a given matrix. The low-rank approximation is a matrix with a lower number of dimensions than the original matrix, but it still captures the essential features of the data.
In the context of recommendation systems, SVD can be used to factorize a user-item matrix into two low-rank matrices representing user preferences and item attributes, respectively. SVD is a popular method for recommendation systems because it is simple, efficient, and provides good results in practice.
user_ratings_threshold = 8
filter_users = rating_explicit['id'].value_counts()
filter_users_list = filter_users[filter_users >= user_ratings_threshold].index.to_list()
df_ratings_top = rating_explicit[rating_explicit['id'].isin(filter_users_list)]
df_ratings_top = df_ratings_top.drop_duplicates(subset=['id', 'ISBN'])
user_item = df_ratings_top.pivot(index='id', columns='ISBN', values='rating')
user_item.fillna(0, inplace=True)
user_item = user_item.astype(np.int32)
!pip install scikit-surprise
# conda install -c conda-forge scikit-surprise # if conda
Requirement already satisfied: scikit-surprise in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (1.1.3) Requirement already satisfied: joblib>=1.0.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from scikit-surprise) (1.2.0) Requirement already satisfied: numpy>=1.17.3 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from scikit-surprise) (1.23.5) Requirement already satisfied: scipy>=1.3.2 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from scikit-surprise) (1.10.0)
from surprise import Dataset, Reader
from surprise import SVD, NMF
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
def model_based_recom_svd(rating_explicit):
# Chargement des données
df = rating_explicit.copy()
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['id', 'ISBN', 'rating']], reader)
# Définition de la grille de paramètres à tester
param_grid = {'n_factors': [50, 100, 150],'n_epochs': [20, 30],'lr_all': [0.002, 0.005],'reg_all': [0.02, 0.04, 0.06]}
# Recherche de grilles pour trouver les meilleurs hyperparamètres
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
trainset, testset = train_test_split(data, test_size=0.2)
# Entraînement du modèle avec les meilleurs hyperparamètres trouvés
best_model = SVD(n_factors=gs.best_params['rmse']['n_factors'],
n_epochs=gs.best_params['rmse']['n_epochs'],
lr_all=gs.best_params['rmse']['lr_all'],
reg_all=gs.best_params['rmse']['reg_all'])
# Evaluation du modèle avec une validation croisée
cross_validate(best_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
predictions = best_model.test(testset)
df_pred = pd.DataFrame(predictions, columns=['user_id', 'isbn', 'actual_rating', 'pred_rating', 'details'])
df_pred = df_pred.assign(abs_err=abs(df_pred['pred_rating'] - df_pred['actual_rating'])).drop('details', axis=1)
return df_pred
df_pred = model_based_recom_svd(rating_explicit)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 1.5211 1.5033 1.5096 1.5131 1.5502 1.5195 0.0164
MAE (testset) 1.1619 1.1489 1.1574 1.1529 1.1817 1.1605 0.0115
Fit time 0.90 0.91 0.91 1.50 1.65 1.17 0.33
Test time 0.13 0.15 0.25 0.27 0.40 0.24 0.10
def model_based_recom_test(df_pred, user_id, ratings):
df_book = book.copy()
df_rating = rating.copy()
# rename column name 'ISBN' to 'isbn' in the book dataframe
df_book = df_book.rename(columns={'ISBN': 'isbn'})
df_rating = df_rating.rename(columns={'ISBN': 'isbn'})
df_pred.rename(columns={'user_id': 'id', 'pred_rating': 'Score'}, inplace=True)
# merge the book dataframe with the rating dataframe on 'isbn' column
df_merged = pd.merge(df_rating, df_book[['isbn', 'title']], on='isbn', how='left')
# merge the df_pred dataframe with the merged dataframe on 'isbn' and 'user_id' columns
df_ext = pd.merge(df_pred[['isbn', 'id', 'Score']], df_merged, on=['isbn', 'id'], how='left')
# filter the dataframe to only include rows with the specified user_id and predicted ratings
df_user = df_ext[(df_ext['id'] == user_id) & (df_ext['Score'].notna())]
# filter the dataframe to only include books with actual ratings of 9 or 10
top_rated_books = df_merged[(df_merged['id'] == user_id) & (df_merged['rating'] >= 9)]
# sort the dataframe by predicted rating in descending order and select the top 5 books
top_pred_books = df_user.sort_values('Score', ascending=False)
# concatenate the two dataframes and select only the 'isbn' and 'rating' columns
recommended_books = pd.concat([top_pred_books, top_rated_books])[['isbn','Score']]
return recommended_books.sort_values('Score', ascending=False)[:nbr_recommendation][['isbn', 'Score']].reset_index(drop=True).rename(columns={'isbn': 'ISBN'})
model_based_recom_test(df_pred=df_pred, user_id=193458, ratings=rating_explicit)
| ISBN | Score | |
|---|---|---|
| 0 | 1853260002 | 9.159259 |
| 1 | 0064471098 | 8.348336 |
| 2 | 0451169530 | 8.279146 |
| 3 | 0399145923 | 8.133062 |
| 4 | 0140298479 | 8.083134 |
display_popular_books(model_based_recom_test(df_pred=df_pred, user_id=193458, ratings=rating_explicit))
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1853260002 | Pride and Prejudice | Jane Austen | Wordsworth Editions | 1992 | 292 | Elizabeth Bennet's early determination to disl... | http://images.amazon.com/images/P/1853260002.0... | 9.159259 |
| 1 | 0064471098 | The Silver Chair (rack) | C. S. Lewis | HarperCollins | 1994 | 272 | Narnia ... where giants wreak havoc ... where ... | http://images.amazon.com/images/P/0064471098.0... | 8.348336 |
| 2 | 0451169530 | The Stand | Stephen King | Signet Book | 1991 | 1172 | Horrific disaster as a plague virus sweeps the... | http://images.amazon.com/images/P/0451169530.0... | 8.279146 |
| 3 | 0399145923 | Carolina Moon | Nora Roberts | Putnam Adult | 2000 | 456 | Illustrated poems written in the sytle of Edwa... | http://images.amazon.com/images/P/0399145923.0... | 8.133062 |
| 4 | 0140298479 | Bridget Jones: The Edge of Reason | Helen Fielding | National Geographic Books | 2001 | 0 | With another devastatingly hilarious, ridiculo... | http://images.amazon.com/images/P/0140298479.0... | 8.083134 |
Non-negative matrix factorization (NMF) is a matrix decomposition technique that can be used for various purposes, such as feature extraction, data compression, and recommendation systems. In the context of recommendation systems, NMF can be used to factorize a user-item matrix into two low-rank matrices representing user preferences and item attributes, respectively.
def model_based_recom_nmf(rating_explicit):
df = rating_explicit.copy()
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df[['id', 'ISBN', 'rating']], reader)
# Définition de la grille de paramètres à tester
param_grid = {'n_factors': [50, 100],
'n_epochs': [20],
'lr_bu': [0.002, 0.005],
'lr_bi': [0.002, 0.005],
'reg_bu': [0.02, 0.06],
'reg_bi': [0.02, 0.06],
'reg_pu': [0.02, 0.06],
'reg_qi': [0.02, 0.06]}
# Recherche de grilles pour trouver les meilleurs hyperparamètres
gs = GridSearchCV(NMF, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
# Entraînement du modèle avec les meilleurs hyperparamètres trouvés
best_model = NMF(n_factors=gs.best_params['rmse']['n_factors'],
n_epochs=gs.best_params['rmse']['n_epochs'],
lr_bu=gs.best_params['rmse']['lr_bu'],
lr_bi=gs.best_params['rmse']['lr_bi'],
reg_bu=gs.best_params['rmse']['reg_bu'],
reg_bi=gs.best_params['rmse']['reg_bi'],
reg_pu=gs.best_params['rmse']['reg_pu'],
reg_qi=gs.best_params['rmse']['reg_qi'])
# Evaluation du modèle avec une validation croisée
cross_validate(best_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
trainset, testset = train_test_split(data, test_size=0.2)
best_model.fit(trainset)
predictions = best_model.test(testset)
df_pred = pd.DataFrame(predictions, columns=['user_id', 'isbn', 'actual_rating', 'pred_rating', 'details'])
df_pred = df_pred.assign(abs_err=abs(df_pred['pred_rating'] - df_pred['actual_rating'])).drop('details', axis=1)
return df_pred
df_pred_nmf = model_based_recom_nmf(rating_explicit)
Evaluating RMSE, MAE of algorithm NMF on 5 split(s).
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean Std
RMSE (testset) 2.3927 2.3679 2.3548 2.3914 2.3620 2.3738 0.0155
MAE (testset) 1.8263 1.8001 1.7840 1.8191 1.7955 1.8050 0.0155
Fit time 3.07 1.75 1.82 1.60 1.46 1.94 0.58
Test time 0.23 0.13 0.14 0.12 0.12 0.15 0.04
model_based_recom_test(df_pred=df_pred_nmf, user_id=193458, ratings=rating_explicit)
| ISBN | Score | |
|---|---|---|
| 0 | 0449006522 | 10.0 |
| 1 | 0345361792 | 10.0 |
| 2 | 006447108x | 10.0 |
| 3 | 0142001740 | 10.0 |
| 4 | 0671880314 | 10.0 |
display_popular_books(model_based_recom_test(df_pred=df_pred_nmf, user_id=193458, ratings=rating_explicit))
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0449006522 | The Manhattan Hunt Club | John Saul | National Geographic Books | 2002 | 0 | Falsely convicted of a brutal crime, college s... | http://images.amazon.com/images/P/0449006522.0... | 10.0 |
| 1 | 0345361792 | A Prayer for Owen Meany | John Irving | William Morrow | 1989 | 648 | A story of friendship through adversity, faith... | http://images.amazon.com/images/P/0345361792.0... | 10.0 |
| 2 | 006447108x | The Last Battle (rack) | C. S. Lewis | HarperCollins | 1994 | 240 | The last battle is the greatest of all battles... | http://images.amazon.com/images/P/006447108X.0... | 10.0 |
| 3 | 0142001740 | The Secret Life of Bees | Sue Monk Kidd | Penguin Books | 2003 | 340 | The multi-million bestselling novel about a yo... | http://images.amazon.com/images/P/0142001740.0... | 10.0 |
| 4 | 0671880314 | Schindler's List | Thomas Keneally | Simon and Schuster | 1993 | 404 | Historical fiction about a man and the Jews he... | http://images.amazon.com/images/P/0671880314.0... | 10.0 |
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
rating_explicit = rating_explicit[rating_explicit['rating']>7]
rating_explicit.shape[0]
40797
The elbow method is a heuristic method used to determine the optimal number of clusters in a KMeans clustering algorithm.
The method plots the within-cluster sum of squares (WCSS) as a function of the number of clusters, and the "elbow" or point of inflection in the plot is considered as the optimal number of clusters.
def elbow_method(df, k):
wcss = [] # Within-Cluster-Sum-of-Squares
for i in tqdm(range(1, k)):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(df)
wcss.append(kmeans.inertia_)
# Plot the elbow graph
plt.plot(range(1, k), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
elbow_method(user_item, k=20)
100%|██████████████████████████████████████████| 19/19 [01:46<00:00, 5.63s/it]
We see that we have to take a large k we choose $k=20$
from sklearn.model_selection import train_test_split
def kmeans(df, k=20, reduce=False):
# Implementation de la matrice user_item
user_item = df.pivot(index='id', columns='ISBN', values='rating')
user_item.fillna(0, inplace = True)
user_item = user_item.astype(np.int32)
if reduce:
user_item_scaled = reduced_user_item(user_item)
else:
# Normaliser les valeurs de la matrice (distance)
user_item_scaled = (user_item - user_item.mean()) / user_item.std()
# Split the data
train, test = train_test_split(df, test_size=0.2, random_state=42)
# Entrainement du k-means sur le train
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=0).fit(user_item_scaled.loc[train['id']])
# Predict clusters pour le test
test_clusters = kmeans.predict(user_item_scaled.loc[test['id']])
# Créer un dictionnaire associant les identifiants d'utilisateurs aux cluster
user_cluster_dict = dict(zip(user_item_scaled.loc[test['id']].index, kmeans.labels_))
# Créer une colonne 'cluster'
test['cluster'] = test['id'].map(user_cluster_dict)
return test
### Recommendation pour l'utilisateur user_id
def make_recommendations_kmeans(user_id, df, n=5):
df.rename(columns={'rating':'Score'},inplace=True)
# Get the cluster of the user
user_cluster = df.loc[df['id'] == user_id]['cluster'].iloc[0]
# Get all the ratings for items from users in the same cluster
item_ratings = df.loc[df['cluster'] == user_cluster].groupby('ISBN')['Score'].mean().reset_index()
# Sort the items by mean rating in descending order
item_ratings = item_ratings.sort_values(by='Score', ascending=False)
# Get the top n items
top_items = item_ratings.head(n).reset_index(drop=True)
# Return the top n items as recommendations for the user
return top_items
rating_cluster = kmeans(df=rating_explicit)
# Afficher les 5 meilleurs recommendation
make_recommendations_kmeans(user_id=277427, df=rating_cluster)
| ISBN | Score | |
|---|---|---|
| 0 | 1931561648 | 10.0 |
| 1 | 0743411269 | 10.0 |
| 2 | 0345447840 | 10.0 |
| 3 | 0345450884 | 10.0 |
| 4 | 0553258559 | 10.0 |
rating_cluster.cluster.value_counts()
user_id_1 = 254
score1 = make_recommendations_kmeans(user_id=user_id_1, df=rating_cluster)
score1
| ISBN | Score | |
|---|---|---|
| 0 | 0060090375 | 10.0 |
| 1 | 0375701214 | 10.0 |
| 2 | 0446611867 | 10.0 |
| 3 | 0446611085 | 10.0 |
| 4 | 0446610038 | 10.0 |
display_popular_books(score1,df=rating_explicit)
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0060090375 | Lucy Sullivan Is Getting Married | Marian Keyes | William Morrow Paperbacks | 2002 | 628.0 | What happens when a psychic tells Lucy that sh... | http://images.amazon.com/images/P/0060090375.0... | 10.0 |
| 1 | 0375701214 | The Diving Bell and the Butterfly | Jean-Dominique Bauby | National Geographic Books | 1998 | 0.0 | A triumphant memoir by the former editor-in-ch... | http://images.amazon.com/images/P/0375701214.0... | 10.0 |
| 2 | 0446611867 | A Bend in the Road | Nicholas Sparks | Warner Books | 2002 | NaN | None | http://images.amazon.com/images/P/0446611867.0... | 10.0 |
| 3 | 0446611085 | Suzanne's Diary for Nicholas | James Patterson | Vision | 2003 | 308.0 | Katie Wilkinson has finally found the perfect ... | http://images.amazon.com/images/P/0446611085.0... | 10.0 |
| 4 | 0446610038 | 1st to Die | James Patterson | Grand Central Publishing | 2002 | 492.0 | Four crime-solving friends face off against a ... | http://images.amazon.com/images/P/0446610038.0... | 10.0 |
!pip install keras
Requirement already satisfied: keras in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (2.11.0)
!pip install tensorflow
Requirement already satisfied: tensorflow in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (2.11.0) Requirement already satisfied: tensorflow-intel==2.11.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow) (2.11.0) Requirement already satisfied: google-pasta>=0.1.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (0.2.0) Requirement already satisfied: setuptools in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (65.5.0) Requirement already satisfied: flatbuffers>=2.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (22.12.6) Requirement already satisfied: numpy>=1.20 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.23.5) Requirement already satisfied: six>=1.12.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.16.0) Requirement already satisfied: absl-py>=1.0.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.3.0) Requirement already satisfied: gast<=0.4.0,>=0.2.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (0.4.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.51.1) Requirement already satisfied: wrapt>=1.11.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.14.1) Requirement already satisfied: keras<2.12,>=2.11.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (2.11.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (2.1.1) Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (4.4.0) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (0.29.0) Requirement already satisfied: protobuf<3.20,>=3.9.2 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (3.19.6) Requirement already satisfied: astunparse>=1.6.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (1.6.3) Requirement already satisfied: libclang>=13.0.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (14.0.6) Requirement already satisfied: tensorboard<2.12,>=2.11 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (2.11.0) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (3.3.0) Requirement already satisfied: packaging in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (22.0) Requirement already satisfied: h5py>=2.9.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (3.7.0) Requirement already satisfied: tensorflow-estimator<2.12,>=2.11.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorflow-intel==2.11.0->tensorflow) (2.11.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from astunparse>=1.6.0->tensorflow-intel==2.11.0->tensorflow) (0.37.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (1.8.1) Requirement already satisfied: werkzeug>=1.0.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2.2.2) Requirement already satisfied: markdown>=2.6.8 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (3.4.1) Requirement already satisfied: google-auth<3,>=1.6.3 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2.15.0) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (0.6.1) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2.28.1) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (0.4.6) Requirement already satisfied: rsa<5,>=3.1.4 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (4.9) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (0.2.8) Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (5.2.0) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (1.3.1) Requirement already satisfied: importlib-metadata>=4.4 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (6.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (1.26.14) Requirement already satisfied: certifi>=2017.4.17 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2022.12.7) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2.1.1) Requirement already satisfied: idna<4,>=2.5 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (3.4) Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (2.1.1) Requirement already satisfied: zipp>=0.5 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (3.11.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\sami\anaconda3\envs\dauphine\lib\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.12,>=2.11->tensorflow-intel==2.11.0->tensorflow) (3.2.2)
The model used for dimentional reduction is an autoencoder.
from keras.layers import Input, Dense
from keras.models import Model
# max_row = user_item.apply(lambda row: np.count_nonzero(row), axis=1).max()
# print("The row with the maximum count contains:", max_row)
def reduced_user_item(user_item, encoding_dim=20):
n_features = user_item.shape[1]
X = user_item
# Define the input shape
input_shape = (n_features,)
# Define the input layer
input_layer = Input(shape=input_shape)
# Define the encoder layers
encoder_layers = Dense(encoding_dim, activation='relu')(input_layer)
# Define the decoder layers
decoder_layers = Dense(n_features, activation='sigmoid')(encoder_layers)
# Define the autoencoder model
autoencoder = Model(inputs=input_layer, outputs=decoder_layers)
# Compile the model
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
# Fit the model to the data
autoencoder.fit(X, X, epochs=2, batch_size=32)
# Obtenir les poids de la première couche
first_layer_weights = autoencoder.layers[1].get_weights()[0]
# Reduction de la matrice user_item
X_reduced = X.dot(first_layer_weights)
# Construction du dataframe
user_reduced_item = pd.DataFrame(data=X_reduced, index=user_item.index)
return user_reduced_item
When the model is trained, the weights of the first layer (the encoder) are extracted. These weights are used to project the original user_item matrix into a lower-dimensional space, resulting in a reduced matrix user_reduced_item.
# False
kmeans(rating_explicit,reduce=False).cluster.value_counts()
17 7408 2 157 1 90 9 74 4 50 19 46 3 44 18 41 13 36 14 34 5 30 6 29 16 25 12 22 15 21 7 15 10 14 11 9 0 9 8 6 Name: cluster, dtype: int64
# True
kmeans(rating_explicit, reduce=True).cluster.value_counts()
Epoch 1/2 91/91 [==============================] - 2s 7ms/step - loss: 0.5217 Epoch 2/2 91/91 [==============================] - 1s 7ms/step - loss: 0.3160
0 1681 8 1348 6 1036 12 918 13 884 4 851 2 378 18 277 1 157 9 106 10 90 19 76 15 74 7 71 3 57 17 46 5 44 16 30 14 22 11 14 Name: cluster, dtype: int64
We can see that using dimensionality reduction has helped to create more evenly distributed clusters
# Compute le dataframe rating_cluster
rating_cluster = kmeans(df=rating_explicit, reduce=True)
Epoch 1/2 91/91 [==============================] - 1s 7ms/step - loss: 0.5004 Epoch 2/2 91/91 [==============================] - 1s 7ms/step - loss: 0.3163
rating_cluster
| id | ISBN | rating | Avg_Rating | nbr_ratings_per_book | cluster | |
|---|---|---|---|---|---|---|
| 348592 | 83287 | 0440210690 | 9 | 7.647059 | 17 | 17 |
| 516197 | 125039 | 0451456718 | 9 | 8.181818 | 11 | 7 |
| 171803 | 37311 | 0440225701 | 8 | 7.704918 | 61 | 11 |
| 574926 | 138379 | 0375700757 | 9 | 7.705882 | 51 | 18 |
| 366995 | 88122 | 0425116840 | 9 | 7.640000 | 25 | 7 |
| ... | ... | ... | ... | ... | ... | ... |
| 825037 | 199209 | 0451452615 | 10 | 7.111111 | 9 | 11 |
| 203879 | 46398 | 0517542099 | 10 | 9.125000 | 8 | 12 |
| 442472 | 105979 | 0452268060 | 8 | 7.450000 | 20 | 17 |
| 462662 | 110973 | 0345409973 | 8 | 8.777778 | 9 | 7 |
| 839227 | 203240 | 038079487x | 8 | 7.660000 | 50 | 17 |
8160 rows × 6 columns
# Afficher les 5 meilleurs recommendation
display_popular_books(make_recommendations_kmeans(user_id=254, df=rating_cluster),df=rating_explicit)
| ISBN | title | author | publisher | date | nombre de pages | description | image_links | Score | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0060090375 | Lucy Sullivan Is Getting Married | Marian Keyes | William Morrow Paperbacks | 2002 | 628.0 | What happens when a psychic tells Lucy that sh... | http://images.amazon.com/images/P/0060090375.0... | 10.0 |
| 1 | 0375701214 | The Diving Bell and the Butterfly | Jean-Dominique Bauby | National Geographic Books | 1998 | 0.0 | A triumphant memoir by the former editor-in-ch... | http://images.amazon.com/images/P/0375701214.0... | 10.0 |
| 2 | 0446611867 | A Bend in the Road | Nicholas Sparks | Warner Books | 2002 | NaN | None | http://images.amazon.com/images/P/0446611867.0... | 10.0 |
| 3 | 0446611085 | Suzanne's Diary for Nicholas | James Patterson | Vision | 2003 | 308.0 | Katie Wilkinson has finally found the perfect ... | http://images.amazon.com/images/P/0446611085.0... | 10.0 |
| 4 | 0446610038 | 1st to Die | James Patterson | Grand Central Publishing | 2002 | 492.0 | Four crime-solving friends face off against a ... | http://images.amazon.com/images/P/0446610038.0... | 10.0 |
Conclusion:
Our collaborative filtering book recommendation system is an ambitious project that required careful data analysis.
We used various techniques to clean the data and ensure that our recommendation model is accurate and efficient. We are confident that our system will help users discover new books and enhance their reading experience.
We will continue to improve and fine-tune our model to provide an even more personalized recommendation experience to our users.